Accelerating language emergence by functional pressures

In language emergence, neural agents acquire communication skills by interacting with one another and the environment. Through these interactions, agents learn to connect or ground their observations to the messages they utter, forming a shared consensus about the meaning of the messages. Such connections form what we refer to as a grounding map. However, these maps can often be complicated, unstructured, and contain redundant connections. In this paper, we introduce two novel functional pressures, modeled as differentiable auxiliary losses, to simplify and structure the grounding maps. The first pressure enforces compositionality via topological similarity, which has been previously discussed but has not been modeled or utilized as a differentiable auxiliary loss. The second functional pressure, which is conceptually novel, imposes sparsity in the grounding map by pruning weaker connections while strengthening the stronger ones. We conduct experiments in multiple value-attribute environments with varying communication channels. Our methods achieve improved out-of-domain regularization and rapid convergence over baseline approaches. Furthermore, introduced functional pressures are robust to the changes in experimental conditions and able to operate with minimum training data. We note that functional pressures cause simpler and more structured emergent languages showing distinct characteristics depending on the functional pressure employed. Enhancing grounding map sparsity yields the best performance and the languages with the most compressible grammar. In summary, our novel functional pressures, focusing on compositionality and sparse groundings, expedite the development of simpler, more structured languages while enhancing their generalization capabilities. Exploring alternative types of functional pressures and combining them in agent training may be beneficial in the ongoing quest for improved emergent languages.


S2 Appendix. Agent architecture
Agents are modeled as recurrent neural networks using LSTMs [1], as it is frequently used in language emergence literature [2][3][4][5][6].First a one-hot encoded input x from a value-attribute environment is fed into the speaker's LSTM cell via a single feed-forward layer F C1 and a batch normalization layer BN as both the initial cell state c 0 and hidden state h 0 .The initial input s 0 is initialized to zeros and is fed to the LSTM cell via a set of Embeddings.
The output hidden state h 1 of the LSTM cell is forwarded through the linear layer F C2 to sample a symbol s 1 ∈ S using Gumbel-softmax sampling [7].As the input for the next time step, the sampled symbol s 1 from the current step is fed back to the LSTM cell through the same set of embeddings.This process is repeated until the number of sampled symbols are equal to the message length T .After reaching the message length, the sampling process is stopped, and the message m ∈ M formed by concatenating the discrete symbols s 1 , . . ., s T , is forwarded to the Listener.
An LSTM layer inside the Listener receives the message first and consume it entirely, and its hidden state is fed to a single linear layer F C3.The output un-normalized log probabilities of F C3 are considered as a representation of the reconstruction x ′ and it is used with the cross entropy to calculate the reconstruction loss.
In all our experiments, we maintain a constant hidden state size of 500 for both the Speaker and the Listener.However, the dimensions of the fully connected layers are varying to accommodate different value-attribute datasets and channel capacities.

Fig 1
Fig 1  shows the architecture of the Speaker and the Listener.Agents are modeled as recurrent neural networks using LSTMs[1], as it is frequently used in language emergence literature[2][3][4][5][6]. First a one-hot encoded input x from a value-attribute environment is fed into the speaker's LSTM cell via a single feed-forward layer F C1 and a batch normalization layer BN as both the initial cell state c 0 and hidden state h 0 .The initial input s 0 is initialized to zeros and is fed to the LSTM cell via a set of Embeddings.The output hidden state h 1 of the LSTM cell is forwarded through the linear layer F C2 to sample a symbol s 1 ∈ S using Gumbel-softmax sampling[7].As the input for the next time step, the sampled symbol s 1 from the current step is fed back to the LSTM cell through the same set of embeddings.This process is repeated until the number of sampled symbols are equal to the message length T .After reaching the message length, the sampling process is stopped, and the message m ∈ M formed by concatenating the discrete symbols s 1 , . . ., s T , is forwarded to the Listener.An LSTM layer inside the Listener receives the message first and consume it entirely, and its hidden state is fed to a single linear layer F C3.The output un-normalized log probabilities of F C3 are considered as a representation of the reconstruction x ′ and it is used with the cross entropy to calculate the reconstruction loss.In all our experiments, we maintain a constant hidden state size of 500 for both the Speaker and the Listener.However, the dimensions of the fully connected layers are varying to accommodate different value-attribute datasets and channel capacities.F C1 : |x| −→ 500, F C2 : 500 −→ S, F C3 : 500 −→ |x| Fig 1  shows the architecture of the Speaker and the Listener.Agents are modeled as recurrent neural networks using LSTMs[1], as it is frequently used in language emergence literature[2][3][4][5][6]. First a one-hot encoded input x from a value-attribute environment is fed into the speaker's LSTM cell via a single feed-forward layer F C1 and a batch normalization layer BN as both the initial cell state c 0 and hidden state h 0 .The initial input s 0 is initialized to zeros and is fed to the LSTM cell via a set of Embeddings.The output hidden state h 1 of the LSTM cell is forwarded through the linear layer F C2 to sample a symbol s 1 ∈ S using Gumbel-softmax sampling[7].As the input for the next time step, the sampled symbol s 1 from the current step is fed back to the LSTM cell through the same set of embeddings.This process is repeated until the number of sampled symbols are equal to the message length T .After reaching the message length, the sampling process is stopped, and the message m ∈ M formed by concatenating the discrete symbols s 1 , . . ., s T , is forwarded to the Listener.An LSTM layer inside the Listener receives the message first and consume it entirely, and its hidden state is fed to a single linear layer F C3.The output un-normalized log probabilities of F C3 are considered as a representation of the reconstruction x ′ and it is used with the cross entropy to calculate the reconstruction loss.In all our experiments, we maintain a constant hidden state size of 500 for both the Speaker and the Listener.However, the dimensions of the fully connected layers are varying to accommodate different value-attribute datasets and channel capacities.F C1 : |x| −→ 500, F C2 : 500 −→ S, F C3 : 500 −→ |x|